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Abstract 

This paper generalizes the traditional statistical concept of pre- 
diction intervals for arbitrary probability density functions in high- 
dimensional feature spaces by introducing significance level distribu- 
tions, which provides interval-independent probabilities for continuous 
random variables. The advantage of the transformation of a proba- 
bility density function into a significance level distribution is that it 
enables one-class classification or outlier detection in a direct manner. 



1 Introduction 



A prediction interval is an interval that will, with a specified degree of confi- 
dence, contain future realizations or, in the terminology of pattern recogni- 
tion, feature vectors (Hahn and Meeker, 1991). The appeal of this concept 



is its clear stochastic meaning. The great disadvantage is that this definition 
is usually too restricted, for example for multimodal distributions. It is in- 
tuitively clear that, in this case, more than one interval for probable feature 
vectors can exist and it would be better to speak of prediction regions. Even 
more complicated is the situation for high-dimensional feature spaces. This 
lack of generality is probably the reason why prediction intervals are rarely 
used in pattern recognition. 

This is actually a pity, because prediction regions would be very useful, 



for example, for the recognition of outliers (Barnett and Lewis 1994) or the 



detection of novelty or normality. Instead of prediction intervals, numerous 
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other methods are used for this purpose. They can be grouped roughly into 
two categories: 



1. Distance-based and novelty or normality score-based approaches (Knorr 



et al. 


2000; 


Doha, 2002; 


2006 




Ting et al. 


2007 


)■ 



Dol la] |2002| |Moonesignhe and Tan] |2006| |Angiulli et~aL 



2. Methods that introduce a separate rejection class in combination with 



a classifier (Singh and Markou, 2004 Steinwart et al. , 2005) 



If applying the method I propose here to outlier detection, it belongs to the 
first category with a probability as normality score. Before going into the 
details, I will give a short overview of related works. 

Simple distance-based methods rely on the concept of the neighborhood 



of a point, for example, the k nearest neighborhood (Knorr et al. , 2000). 



Outliers are those points for which there are less than k points within a 



distance 5 in the dataset. Ramaswamy et al. (2000) propose a method to 



choose this threshold S automatically based upon a dataset. The idea is to 
consider as outliers the set of points with the highest distances to their kth 
nearest neighbors. Of course, here is also a threshold necessary, but it has 
now a statistical reasoning as quartile of the kth nearest neighbor distance 
distribution, which simplifies the choice. A more recent article based on this 
idea is published by Angiulli et al. (2006), who apply a weighted sum of 



the kth nearest distances per point. Although the idea is quite simple, the 
methods have low computation costs. Furthermore, they make only minor 
assumptions about the underlying distribution. 

Another category of algorithms that are related to outlier detection is 
robust regression. The outlier detection is here more a means to an end, 
because the goal is to avoid that outliers influence the estimation of the 
regression function. This means that it is in this case sufficient to detect 



outliers indirectly. Ting et al. (2007), for example, apply an outlier-score to 
control the influence that a point has in the parameter estimation process of 
the regression function. For this purpose, weights are introduced, which are 
estimated based on the assumption that the noise is Gaussian distributed. 
This is often sufficient, for example for most sensor signals. The algorithm 
is real-time capable, but far away from generality. This method also belongs 
to category one. 

The idea of the second category is very different. At the first glance, it 
seems to be impossible to use classifiers to detect outliers, because classifiers 
need for the estimation of their parameters samples from inliers and outliers. 
Usually, only samples for inliers are available. The idea is to create a enclosing 
cloud of outlier samples synthetically with a random generator. Afterwards, 
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it is possible to train the classifier. Singh and Markou (2004), for example, 



apply a neural network for this purpose. Other classifiers are also possible, 
for example, an SVM (Steinwart et al. 2005). Regardless of the applied 



classifier, these probabilistic methods need for the generation of the hull a 



measure in which degree a generated sample point is an outlier. Singh and 



|Markou (2004), for example, use for this purpose simple prediction intervals 
(2.5 a ranges). 

In conclusion, both categories have to solve the same problem: to find an 
appropriate zero level set for the inlier generating density. In the subsequent 
sections I will show that this problem can be mapped to a choice of a sig- 
nificance level and that it is possible to generalize the traditional statistical 
concept of prediction intervals to prediction regions. 



2 Prediction Regions 

A prediction interval denotes a region in which future feature vectors x occur 
with a predetermined probability. For its computation, it is essential that 
the generating probability density function px( x ) f° r the random variable 
X is known. In this case 

P{xi <X<x 2 ) = J p x (u>) du; (1) 

is the probability for a future feature vector within the region [xi, x 2 ]. 

For a prediction interval, the region borders have to be established so 
that the probability for outliers is lower than a given, fixed threshold - the 
significance level a. A typical example is a = 5% and mean that the region 
is to be determined so that 95% of all possible feature vectors fall into it. 
Usually, an infinite number of borders fulfill this requirement. Thus, for 
Gaussian distributions, the region is centered on the expectation value. But, 
this definition is only appropriate for this special case. 

The reason for this is the fact that the integration borders are defined 
directly. A way to solve this problem is to apply the level set idea with the 
probability density function px{ x ) a s level set function. The set 

T e = {x\ Px (x)-9 = 0} (2) 

can serve as implicit definition for the integration borders. The question is, 
how the threshold 9 corresponds to the significance level a so that 

a= J p x (aj)du> (3) 

px M<0 
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becomes true. 

Let Fy be the cumulative distribution function of the probability density 
function values y = Px(x) and Fy 1 its inverse. In this case, the threshold 9 
can be computed from a given significance level a by 

6 = Fy\a). (4) 

I now prove this assertion. 



Proof 1 The feature vectors x can be interpreted as realizations of the vec- 
torial random variable X . Because of this, Y = px(X) is also a random 
variable. But contrary to X , Y is scalar valued. The relation between X 
und Y is strictly deterministic, that is, 

Py\x(v\x) = 6{y - p x (x)) (5) 

and consequently 

Pv,x(y, x) = 8{y- p x (x))p x (x). (6) 

The marginalization in respect to X gives finally a formula to convert the 
probability density function px into the probability density function py: 

Vv{y) = J 5(y- px{u)) p x (u)dus. (7) 

Now, we can calculate the cumulative distribution function Fy(y), which is 
defined by 

y 

My) = f Priy'W. (8) 

— oo 

Inserting in delivers 

y 

F Y (y)= J J6(y'-p x (u))pxndujdy'. (9) 



With the Interval function 



b \ _ f 1; a < x < b . . 

a ) 0, otherwise 
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the expression can be transformed to 

Fy(v)= / n (p x (uj)) p x (u))du 

J -°° 



p x (v)du> 

Px {u)<y 



A comparison shows that the right side of expression §3y is identical to Fy(6). 
For this reason, we can write Q) as 

a = F Y (9). (12) 

The cumulative distribution function Fy is per definition monotonic. If it is 
even strictly monotonic, the inverse function Fy 1 exists and we can compute 
9 for a given a by expression Q). Otherwise, we could obtain for one value 
of a an interval of possible values for 9. Because all values in this interval 
result in the same a for the integration |5|) 7 any value in this interval can be 
used to solve the equation. 



Summarization: A significance level a, as it is usually applied in statis- 
tics, can be transformed with the scalar valued function Fy 1 into a level 
set threshold 9. With this threshold, it is possible to classify those feature 
vectors with a statistical significance of 1 — a as outlier, whose probability 
density function values px{x) are lower than 9. The ranges for which the 
condition px{x) > 9 is fulfilled are the prediction regions. 



3 Significance Level Distributions 



It is shown that the cumulative distribution function Fy(y) can be used to 
transform a given significance level a into a level set threshold 9. But, because 



of expression (11), 1 — Fy(y) can also be applied as measure to describe our 



degree of surprise about a certain feature vector x. We summarize and define 



bjr(x) 



J p x (u)duj = Fy{p x {x)) 



(13) 



and name bx{&) the significance level distribution of x for the random vari- 
able X. 
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The significance level distribution is in the true sense of the word a "prob- 
ability distribution" because it provides a probability (the significance level) 
for every continuous realization x. Unfortunately, the term "probability dis- 
tribution" is already used for probability density functions, which do not 
provide probabilities but probability density values. Note that the signifi- 
cance level distribution does not deliver the probability for a single realiza- 
tion x itself, but the probability for all even more unlikely realizations than 
x. Nevertheless, bx{x) provides valuable information for the assessment of 
the realization x and allows to decide if it is sure, probable, or only possible. 
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Figure 1: For unimodal and symmetric distributions the significance level 
distribution (C) and the prediction interval (A) lead to the same results. 
But for multimodal distributions only the significance level distribution is 
reasonable (D). 



For simple standard distributions, such as the Gaussian distribution or the 
Cauchy distribution, the significance level distribution can be given in closed 
form. Note that for a symmetric and unimodal distribution the significance 
level distribution and the prediction interval is identically (see Fig. [T]). For 
more complex distributions this is usually not valid and it is here seldom 
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possible to give the significance level distribution in closed form. In these 
cases it is reasonable to estimate the cumulative distribution function Fy. 
The next section [4] proposes a method and investigates its convergence speed. 
Figure [2] shows an example of a significance level distribution for a non-trivial 
probability density function. Please note that significance level distributions 
are not restricted to the one-dimensional case. 



Px{x) 




Figure 2: An example for a probability density function and its related sig- 
nificance level distribution. The white zones are the "outlier regions" for a 
significance level of 5%. The threshold 9 = 0.00326 corresponds to a = 5%. 



4 The Estimation of F* 



Y 

Contrary to the complicated integration ([3]), the expression (12) can be eas- 
ily computed, if we estimate Fy. We assume that the probability density 
distribution Px( x ) is known or can be appropriately estimated. With the 
knowledge of px( x ), it is possible to generate n correspondingly distributed 
random samples: 

D x = {x 1 ,...,x n }. (14) 

Now we can transform this dataset into a dataset of probability density 
function values 

Dy = {pxM,. . . , Px (x n )} 
= {Vl, ■ ■ - ,2/nj- 
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With this dataset Dy, the cumulative distribution function Fy can be esti- 
mated by 



1 

F Y {y) = -Y,®(y-yk) ( 16 ) 

with the Heaviside function 



n 

k=l 



^ = { 0, otherwise ' < 17 > 

The Glivenko-Cantelli theorem guarantees the convergence of this empiri- 
cal distribution function Fy(y) to the true cumulative distribution function 
Fy(y) for n — > oo. Note that it is unnecessary to sum over all elements yt for 



the computation of expression (16) if the dataset Dy is sorted. In this case, 
a binary search with computation costs of Q(\og(n)) can be applied. 

It is possible to give the root mean squared error of this estimator in 
dependency on the size of the dataset n: 



RAISE = \IF Y - — — < — =. (18) 
n y/4n 

This formula makes it possible to calculate the number of samples n nec- 
essary for a desired accuracy with a given significance level a = Fy. It is 
important that the convergence speed does neither depend on the generating 
density px(x) nor on the dimension of the original problem ([3]). It follows a 



proof of expression (18). 



Proof 2 We calculate the mean squared error MSE by computing the expec- 
tation value 



+oo +oo 



£{{F y ~F y f}= J ... J (F y -F y ) 2 p Y (y 1 )dy 1 ...py(y n )dy n (19) 



-oo — oo 



of the squared error in respect to all elements in the dataset (15). The MSE 
is usually written as sum 

MSE = BIAS 2 + VAR (20) 

with 

BIAS = £{F Y } - F Y (21) 

and 

VAR = £{F Y } - S{F Y } 2 . (22) 
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For the estimation formula (16) is 
1 n 

£{F Y } = -Y J £{Q(y-yk)} 



k=l 



1 - 

n ^ 

k=l 



-oo +oo 



9{y - yk)py(yi) ■ ..p Y {y n )dyi ■ ■ ■ dy„ 



-co -co 
+oo 



i r 

= ~ E / - yk)pY(yk)dy k 

K - 1 -oo 

= ~ E / Pv(yk)dy k = - Y] f y = Fy. 

k=l k=l 

-co 

This shows that the estimator (16) is unbiased. Furthermore, we have 

n n 

k=i j=i 

- n n 

= ~2 E E " f*)K{e(2/ - %)} 
fc=i jyfc 

i n 

+ ^E^-^) 2 }- 

Because of Q(y — y^) 2 = Q(y — y^), we obtain 

^> = iEE^+iE*v 



fc = l j^fc 

-Fy - 1 + 1 Fy. 

n n 



k=l 



Inserting (25) and (23) into (22) yields 



VAR = MSE = - Fy (1 - Fy) . 



n 



Finally, because of RMSE = VMSE, we obtain expression ( lty. 
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5 Overview of the method 



Estimation 

1. If the inlier generating density px{ x ) is unknown, estimate it with a 
suitable algorithm. 

2. Choose a upper bound for the RMSE and calculate the necessary num- 
ber n of random samples to generate: n = 1/(2 RMSE) 2 . 

3. Generate the random samples D x = {xi, . . . , x n }. 

4. Compute the derived dataset Dy = px(Dx)- 

5. Sort Dy. 

Application 

1. Choose a significance level a. 

2. Compute the density value y = px{x) for the interesting feature vector 
x. 

3. Calculate the significance level value z = bx{x) by computing z = 
Fy(y). 

4. Classify x as outlier, if z < a. 



6 Experimental Validation of Equation (18) 



Usually, the inlier generating density px{&) is unknown and has to be esti- 
mated. For this reason, the quality of the proposed method for the outlier 
detection depends significantly on the quality of the applied density estima- 
tion algorithm. The influence of the Fy estimation is, however, marginal, 



because the RMSE (18) can be reduced arbitrarily - in contrast to the esti- 
mation of px{x) - by increasing the random sample number n. The following 
experiment verifies this by comparing a theoretically determined significance 
level distribution b x to an estimated version b X - 

It is only for some simple distributions possible to give the significance 
level distribution in closed form. Such an example is the Gaussian distribu- 
tion 

1 _ * 2 

P*( X ) = /K- e ( 2T ) 

V I 7T a 
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The significance level distribution has, in this case, the form 



b x {x) 



1 — erf 



\x\ 



V2 



a 



(28) 



In an experiment, I have estimated the significance level distribution for 
a standard normal distribution with the proposed method by varying the 
random sample number n. I have averaged for each value of n the squared 



differences between the closed form (28) and the estimated versions over 2000 



single estimations. The results are summarized in Fig. |3j which shows that 
the experiment confirms the theoretical predictions. 



n = 100 



n = 1000 




10000 



n = 100000 




Figure 3: The figure shows the averaged root mean squared errors of 2000 
single experiments for the significance level distribution estimation of a stan- 
dard normal distribution by different random sample numbers n (black lines) 
in comparison to the theoretical errors (fat white lines in the background). 



7 Conclusions 

In this article, I have shown that it is always possible to compute prediction 
regions as generalization of prediction intervals, no matter if the generating 
density is high-dimensional or multimodal. Only the density has to be known 
or estimated. 

The idea was to define the integration borders indirectly by a zero level 
set with the probability density function as level set function. This has lead 
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to the problem of transforming the significance level defining a prediction 
interval into a level set threshold. I have shown that this can be easily 
accomplished by the cumulative distribution function of the probability den- 
sity function values. The advantage is that the complicated integration in 
the high-dimensional feature space is mapped to a one dimensional function 
evaluation. 

Furthermore, I have introduced a new probability measure, the signif- 
icance level distribution, which can be easily derived from the probability 
density function. The advantage is that it enables the assessment of the 
"plausibility" of an realization or feature vector because it provides proba- 
bilities also for continuous realizations. The transformation procedure has 
low computation costs and the estimation error of the method is negligible. 

Please note that in practice the performance of the proposed method for 
one-class classification tasks depends significantly on the quality of the ap- 
plied density estimation method, just like the quality of a Bayes classifier for 
multi-class classification. On the contrary, for an optimal estimated density, 
the method would be necessarily optimal for one-class classification, just like 
a Bayes classifier is optimal for the multi-class classification case. Because 
density estimation it not the topic of this article I have deliberately omitted 
some experimental comparisions with other outlier recognition or one-class 
classification methods. 
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